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Scientific  Progress 


The  goal  of  this  project  was  to  enable  hardware-based  security  techniques  on  future  microprocessors  in  a  way  that  they  can  be 
added  and  updated  after  fabrication,  similar  to  software,  while  maintaining  the  efficiency  and  the  security  of  hardware.  To 
achieve  this  goal,  the  project  investigated  how  to  enable  a  wide  range  of  hardware  security  techniques  in  a  programmable 
fashion  or  on  an  existing  platforms.  The  list  includes  cache  extensions  (FPL’11),  instruction-grained  run-time  program  monitors 
(ICCD’1 1 ,  DSN’12,  ICCD’12),  and  steganography  in  Flash  memory  (IEEE  S&P’13).  The  project  also  developed  techniques  to 
enable  run-time  monitoring  for  real-time  systems  that  require  timely  responses  (DAC’12,  RTAS’14,  HPCA’15).  Finally,  the 
investigation  of  real-time  systems  also  led  to  a  new  performance  optimization  (DAC’15)  and  timing  channel  protection  (HPCA’ 
14). 

The  following  paragraphs  discuss  each  major  technical  development  in  more  detail. 

1.  FlexCache:  Extensible  Cache  Architecture  (FPL’11) 

In  today’s  microprocessors,  the  cache  architecture  is  highly  optimized  for  one  particular  design  and  cannot  be  changed  after 
fabrication.  While  allowing  efficient  implementations  in  dedicated  logic,  this  inflexibility  also  implies  that  new  techniques  cannot 
be  deployed  in  the  field. 

To  solve  this  problem,  we  designed  a  new  flexible  cache  architecture,  named,  FlexCache,  which  uses  on-chip  reconfigurable 
fabric  to  enable  new  extensions  to  be  added  in  the  field  after  fabrication.  The  key  idea  in  this  architecture  is  to  perform  frequent 
operations  such  as  a  cache  hit  with  dedicated  hardware,  while  allowing  less  frequent  operations  such  as  a  cache  miss  to  be 
customized  using  reconfigurable  fabric. 

We  evaluated  the  flexibility  and  efficiency  of  the  architecture  through  an  RTL  prototype  implementation  of  the  cache  along  with 
example  extensions  such  as  cache  performance  counters,  side-channel  protection,  prefetching,  various  replacement  policies 
and  computation  acceleration.  The  results  show  that  various  types  of  extensions  can  be  realized  on  FlexCache  with  minimal 
impact  on  performance,  power,  and  area. 

2.  Instruction-Grained  Run-Time  Program  Monitoring 

A  large  part  of  this  project  was  focused  on  developing  efficient  and  flexible  run-time  program  monitoring  framework.  The  run¬ 
time  monitoring  is  designed  to  check  individual  instructions  at  run-time,  and  detect  anomaly. 

The  project  started  with  a  previously  developed  architecture,  named  FlexCore  (MICRO’IO),  as  the  baseline.  FlexCore  is  a 
hybrid  processor  architecture  where  an  on-chip  reconfigurable  fabric  (FPGA)  is  tightly  coupled  with  the  main  processing  core. 
FlexCore  supports  a  broad  range  of  run-time  monitoring  and  bookkeeping  techniques  by  programming  the  reconfigurable  fabric. 

-  Precise  Exception  Support  (ICCD’1 1) 

Parallel  monitoring  framework  such  as  FlexCore  decouples  the  execution  of  the  monitored  program  from  the  monitor  in  order  to 
minimize  performance  impacts.  Yet,  such  decoupled  architectures  lack  support  for  precise  exceptions  and  can  only  detect  an 
exception  (attack)  after  the  monitored  program  completes  an  erroneous  operation. 

In  this  project,  we  developed  a  new  architectural  mechanism  to  support  precise  exceptions  in  non-speculative  processors  with 
decoupled  monitors.  The  main  is  to  create  a  shadow  architecture  state,  which  allows  rolling  back  the  processor  state  in  case  of 
an  exception.  Experimental  results  based  on  an  RTL  implementation  show  that  our  approach  has  low  area,  power,  and 
performance  overheads  even  when  applied  to  simple,  in-order  processors. 

-  High-Performance  Monitoring  Accelerator  (DSN’12) 

While  being  flexible,  typical  reconfigurable  fabric  such  as  FPGAs  has  a  rather  limited  clock  frequency  because  of  unpipelined 
routing.  For  example,  we  found  that  FlexCore  could  support  monitoring  for  processors  with  moderate  clock  frequencies 
(<500MHz)  with  low  overhead,  but  incurs  high  overhead  for  high-performance  processors. 

To  address  this  limitation,  we  designed  a  high-performance  monitoring  accelerator,  named  Harmoni.  The  Harmoni  architecture 
achieves  much  higher  efficiency  than  software  implementations  and  previously  proposed  monitoring  platforms  such  as 
FlexCore  by  closely  matching  the  common  characteristics  of  run-time  monitoring  functions  using  the  notion  of  tagging.  In 
essence,  the  key  idea  is  to  encode  monitoring  operations  as  tagging  operations  where  each  instruction,  register,  and  memory 
location  carries  a  tag  encoding  security  properties. 

We  implemented  an  RTL  prototype  of  Harmoni  and  evaluated  it  with  several  example  monitoring  functions  for  security  and 
programmability.  The  prototype  demonstrates  that  the  architecture  can  support  a  wide  range  of  monitoring  functions  with 
different  characteristics.  Harmoni  takes  moderate  silicon  area,  has  very  high  throughput,  and  incurs  low  overhead  on  monitored 


programs  even  on  processors  with  a  high  frequency  (2.5GHz). 

-  Fast  Design  Framework  (ICCD’12) 

The  parallel  monitoring  architecture  requires  each  monitor  to  be  designed  as  custom  hardware  module.  Unfortunately,  this  cost 
and  time  for  implementing  each  hardware  monitor  presents  a  major  obstacle  in  deploying  the  run-time  monitoring  techniques  in 
real  systems. 

This  project  addressed  the  design  complexity  problem  by  developing  a  common  architecture  framework  and  using  high-level 
synthesis.  Similar  to  customizable  processors  such  as  Tensilica  Xtensa  where  designers  only  need  to  write  a  small  piece  of 
code  that  describes  a  custom  instruction,  our  framework  enables  designers  to  only  specify  monitoring  operations.  The 
framework  provides  common  functions  such  as  collecting  a  trace  of  execution,  maintaining  meta-data,  and  interfacing  with 
software.  To  further  reduce  the  design  complexity,  we  also  explored  using  a  high-level  synthesis  tool  (Cadence  C-to-Silicon)  so 
that  hardware  monitors  can  be  described  in  a  high-level  language  (SystemC)  instead  of  in  RTL  such  as  Verilog  and  VHDL. 

To  evaluate  our  approach,  we  implemented  a  set  of  monitors  including  soft-error  checking,  uninitialized  memory  checking, 
dynamic  information  flow  tracking,  and  array  boundary  checking  in  our  framework.  Our  results  suggest  that  our  monitor 
framework  can  greatly  reduce  the  amount  of  code  that  needs  to  be  specified  for  each  extension,  from  2105-to-2626  lines  to  46- 
to-209  lines.  The  high-level  synthesis  was  also  found  to  be  effective,  achieving  comparable  area,  performance,  and  power 
consumption  to  handwritten  RTL. 

3.  Run-Time  Monitoring  for  Real-Time  Systems 

The  increasing  safety-critical  role  of  real-time  systems  such  as  automotive  control  systems  requires  increased  attention  to  their 
security  and  reliability.  Yet,  the  run-time  monitoring  techniques  that  we  developed  cannot  be  applied  to  hard  real-time  systems 
without  a  way  to  ensure  that  the  monitoring  will  not  lead  to  deadline  misses.  In  this  project,  we  developed  a  set  of  techniques  to 
enable  parallel  program  monitoring  for  real-time  systems. 

-  Static  WCET  Analysis  (DAC'12) 

Traditional  real-time  system  designs  use  the  worst-case  execution  time  (WCET)  estimate  of  each  task  to  design  a  system  with 
timing  guarantees.  In  this  project,  we  developed  the  first  analysis  method  for  statically  determining  the  impact  of  parallel 
monitoring  on  WCET  using  a  mixed  integer  linear  programming  (MILP)  formulation.  The  key  innovation  in  this  work  was  the 
method  to  model  a  FIFO  between  a  monitored  core  and  a  monitor,  and  analytically  capturing  cases  when  the  FIFO  may  be  full. 
We  used  our  method  to  estimate  the  WCET  for  seven  benchmark  programs  and  two  possible  monitoring  techniques.  This 
estimate  was  compared  against  observed  execution  times  from  simulation  and  a  conservative  upper  bound  based  on  sequential 
monitoring.  The  results  showed  that  our  method  is  far  more  accurate  compared  to  the  conservative  bound.  It  estimates  WCETs 
within  71%  of  worst-case  observed  execution  times  and  up  to  74%  lower  than  the  sequential  bound. 

-  Run-Time  Management  for  Hard  Real-Time  Systems  (RTAS’14) 

The  WCET  estimate  above  requires  a  system  to  be  designed  with  an  enough  additional  slack  (margin)  in  order  to  deploy  run¬ 
time  monitoring.  For  the  cases  when  such  a  slack  is  not  an  option  or  too  expensive,  we  also  developed  a  run-time  mechanism 
to  ensure  the  WCET  of  a  monitored  program. 

In  this  framework,  we  leverage  the  fact  that  programs  typically  run  faster  than  its  WCET.  Monitoring  is  only  performed  when 
enough  dynamic  slack  exists  in  order  to  ensure  that  the  monitoring  does  not  impact  the  timing  guarantees  of  tasks.  If  the  slack 
is  insufficient,  our  framework  drops  a  monitoring  operation  while  ensuring  that  there  is  no  impact  on  the  monitored  task  and  that 
there  is  no  false  positive  in  the  future.  For  efficient  dropping,  we  developed  a  novel  hardware  architecture  that  can  perform  this 
dropping  operation  in  a  single  cycle,  matching  the  throughput  of  the  task  being  monitored. 

Thus,  run-time  monitoring  can  be  applied  opportunistically,  with  no  impact  on  the  worst-case  execution  time  of  tasks.  Our 
experimental  results  for  three  different  monitoring  techniques  verify  that  timing  is  never  violated  and  that  false  positives  never 
occur.  In  addition,  on  average,  15-66%  of  monitoring  coverage  is  achieved  on  a  multi-core  platform  with  no  impact  on  the  worst- 
case  execution  times  of  tasks  depending  on  the  monitoring  technique.  With  an  FPGA-based  monitor  such  as  FlexCore,  this 
average  coverage  of  monitoring  ranged  from  62-86%  depending  on  the  monitoring  technique. 

-  Coverage-Overhead  Trade-off  on  Soft  Real-Time  Systems  (HPCA’15) 

The  idea  of  only  performing  a  subset  of  monitoring  can  be  leveraged  to  not  only  enable  monitoring  of  hard  real-time  systems, 
but  also  enable  the  trade-off  between  coverage  and  overhead.  With  this  intuition,  we  extended  our  framework  in  RTAS’14  with  a 
new  hardware  data-flow  tracking  engine  that  enables  adjustable  overhead  through  partial  monitoring.  This  new  framework 


enables  users  to  specify  a  desired  overhead  level  for  run-time  monitoring.  The  dataflow  engine  was  also  extended  to  filter  out 
monitoring  operations  associated  with  null  metadata  in  order  to  reduce  overhead.  Given  this  architecture,  we  investigated  how 
the  dropping  decisions  should  be  made  for  partial  monitoring  and  show  that  there  exist  interesting  policy  decisions  depending 
on  the  target  application  of  partial  monitoring.  Our  experimental  results  show  that  overhead  can  be  reduced  significantly  by 
trading  off  coverage.  For  example,  for  monitoring  techniques  with  average  overheads  of  2-6x,  the  proposed  architecture  is  able 
to  reduce  overhead  to  1.5x  while 
still  achieving  14-85%  average  coverage. 

4.  Flash  Memory  Steganography  (IEEE  S&P  2013) 

Recently,  researchers  showed  that  analog  characteristics  of  integrated  circuits  can  be  used  to  build  new  hardware  security 
functions  such  as  device  fingerprints.  Yet,  these  techniques  require  building  new  custom  circuits. 

In  this  project,  we  found  that  some  of  the  analog  characteristics  of  off-the-shelf  Flash  memory  can  be  measured  externally 
through  a  standard  digital  interface.  From  this  observation,  we  developed  a  novel  information  hiding  technique  for  Flash 
memory  that  can  be  implemented  on  off-the-shelf  Flash  memory  chips  without  custom  circuits.  The  method  hides  data  within  an 
analog  characteristic  of  Flash,  the  program  time  of  individual  bits.  The  program  time  of  each  bit  gets  longer  as  the 
corresponding  memory  cell  wears  out,  and  we  can  change  them  individually  by  repeatedly  writing  the  memory  while  controlling 
which  value  is  written  to  each  cell.  Then,  the  relative  differences  in  the  program  time  can  be  used  to  store  bits  intentionally. 

Because  the  technique  uses  analog  behaviors,  normal  Flash  memory  operations  are  not  affected  and  hidden  information  is 
invisible  in  the  data  stored  in  the  memory.  Even  if  an  attacker  checks  a  Flash  chip’s  analog  characteristics,  experimental  results 
indicate  that  the  hidden  information  is  difficult  to  distinguish  from  inherent  manufacturing  variation  or  normal  wear  on  the  device. 
Moreover,  the  hidden  data  can  survive  erasure  of  the  Flash  memory  data,  and  the  technique  can  be  used  on  current  Flash 
chips  without  hardware  changes. 

5.  Other  Work 

Our  work  on  enabling  the  run-time  monitoring  for  real-time  systems  also  led  to  a  few  new  techniques  related  to  the  timing  of  a 
system. 

-  WCET  Optimization  (DAC15) 

Worst-case  execution  time  (WCET)  analysis  is  a  critical  part  of  designing  real-time  systems  that  require  strict  timing  guarantees. 
However,  data  caches  have  traditionally  been  challenging  to  analyze  in  the  context  of  WCET  due  to  the  unpredictability  of 
memory  access  patterns.  We  developed  a  new  cache  structure,  namely  register-indexed  cache,  that  is  designed  to  be  more 
amenable  to  static  analysis  compared  to  traditional  caches.  This  is  based  on  the  idea  that  absolute  addresses  may  not  be 
known,  but  by  using  relative  addresses,  analysis  may  be  able  to  guarantee  a  number  of  hits  in  the  cache.  In  addition,  we 
observed  that  keeping  unpredictable  memory  accesses  in  caches  could  increase  or  decrease  WCET  depending  on  the 
application.  Thus,  we  explored  selectively  bypassing  caches  in  order  to  provide  lower  WCET.  Our  experimental  results  showed 
reductions  in  WCET  of  up  to  35%  over  the  state-of-the-art  static  analysis. 

-  Timing  Channel  Protection  for  Shared  Memory  (HPCA’14) 

We  found  that  that  shared  memory  controllers  are  vulnerable  to  both  side  channel  and  covert  channel  attacks  that  exploit 
memory  interference  as  timing  channels.  To  address  this  vulnerability,  we  designed  a  secure  memory  controller  that  enables 
secure  sharing  of  main  memory  among  mutually  mistrusting  parties  by  eliminating  memory  timing  channels.  To  eliminate  timing 
channels,  we  identified  the  sources  of  interference  in  a  conventional  memory  controller  design,  and  proposed  a  protection 
scheme  to  eliminate  the  interference  across  security  domains  through  two  main  changes:  (i)  a  per  security  domain  based 
queueing  structure,  and  (ii)  static  allocation  of  time  slots  in  the  scheduling  algorithm.  Multi-programmed  workloads  comprised  of 
SPEC2006  benchmarks  were  used  to  evaluate  the  protection  scheme.  The  results  show  that  the  proposed  scheme  completely 
eliminates  the  timing  channels  in  the  shared  memory  with  small  hardware  and  performance  overheads. 

Technology  Transfer 

The  techniques  to  use  COTS  Flash  memory  for  security  functions  attracted  some  commercial  interest.  A  company  named 
CybKey  Tech  expressed  interest  in  evaluating  the  technology.  Researchers  in  CERT  (CMU)  also  asked  help  in  replicating  the 
results  -  we  provided  necessary  hardware  and  software. 


